AITopics | tf-idf score

Collaborating Authors

tf-idf score

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Feature Selection Empowered BERT for Detection of Hate Speech with Vocabulary Augmentation

Desai, Pritish N., Kewalramani, Tanay, Mandal, Srimanta

arXiv.org Artificial IntelligenceDec-3-2025

Abusive speech on social media poses a persistent and evolving challenge, driven by the continuous emergence of novel slang and obfuscated terms designed to circumvent detection systems. In this work, we present a data efficient strategy for fine tuning BERT on hate speech classification by significantly reducing training set size without compromising performance. Our approach employs a TF IDF-based sample selection mechanism to retain only the most informative 75 percent of examples, thereby minimizing training overhead. To address the limitations of BERT's native vocabulary in capturing evolving hate speech terminology, we augment the tokenizer with domain-specific slang and lexical variants commonly found in abusive contexts. Experimental results on a widely used hate speech dataset demonstrate that our method achieves competitive performance while improving computational efficiency, highlighting its potential for scalable and adaptive abusive content moderation.

machine learning, natural language, training data, (17 more...)

arXiv.org Artificial Intelligence

2512.02141

Genre: Research Report > New Finding (0.47)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Efficient Zero-Shot Long Document Classification by Reducing Context Through Sentence Ranking

Kokate, Prathamesh, Sarnaik, Mitali, Khopade, Manavi, Takalikar, Mukta, Joshi, Raviraj

arXiv.org Artificial IntelligenceAug-26-2025

Transformer-based models like BERT excel at short text classification but struggle with long document classification (LDC) due to input length limitations and computational inefficiencies. In this work, we propose an efficient, zero-shot approach to LDC that leverages sentence ranking to reduce input context without altering the model architecture. Our method enables the adaptation of models trained on short texts, such as headlines, to long-form documents by selecting the most informative sentences using a TF-IDF-based ranking strategy. Using the MahaNews dataset of long Marathi news articles, we evaluate three context reduction strategies that prioritize essential content while preserving classification accuracy. Our results show that retaining only the top 50\% ranked sentences maintains performance comparable to full-document inference while reducing inference time by up to 35\%. This demonstrates that sentence ranking is a simple yet effective technique for scalable and efficient zero-shot LDC.

classification, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2508.1749

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Improving the Efficiency of Long Document Classification using Sentence Ranking Approach

Kokate, Prathamesh, Sarnaik, Mitali, Khopade, Manavi, Joshi, Raviraj

arXiv.org Artificial IntelligenceJun-24-2025

Long document classification poses challenges due to the computational limitations of transformer-based models, particularly BERT, which are constrained by fixed input lengths and quadratic attention complexity. Moreover, using the full document for classification is often redundant, as only a subset of sentences typically carries the necessary information. To address this, we propose a TF-IDF-based sentence ranking method that improves efficiency by selecting the most informative content. Our approach explores fixed-count and percentage-based sentence selection, along with an enhanced scoring strategy combining normalized TF-IDF scores and sentence length. Evaluated on the MahaNews LDC dataset of long Marathi news articles, the method consistently outperforms baselines such as first, last, and random sentence selection. With MahaBERT-v2, we achieve near-identical classification accuracy with just a 0.33 percent drop compared to the full-context baseline, while reducing input size by over 50 percent and inference latency by 43 percent. This demonstrates that significant context reduction is possible without sacrificing performance, making the method practical for real-world long document classification tasks.

classification, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2506.07248

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Classification (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

On the Biased Assessment of Expert Finding Systems

Decorte, Jens-Joris, Van Hautte, Jeroen, Develder, Chris, Demeester, Thomas

arXiv.org Artificial IntelligenceOct-7-2024

In large organisations, identifying experts on a given topic is crucial in leveraging the internal knowledge spread across teams and departments. So-called enterprise expert retrieval systems automatically discover and structure employees' expertise based on the vast amount of heterogeneous data available about them and the work they perform. Evaluating these systems requires comprehensive ground truth expert annotations, which are hard to obtain. Therefore, the annotation process typically relies on automated recommendations of knowledge areas to validate. This case study provides an analysis of how these recommendations can impact the evaluation of expert finding systems. We demonstrate on a popular benchmark that system-validated annotations lead to overestimated performance of traditional term-based retrieval models and even invalidate comparisons with more recent neural methods. We also augment knowledge areas with synonyms to uncover a strong bias towards literal mentions of their constituent words. Finally, we propose constraints to the annotation process to prevent these biased evaluations, and show that this still allows annotation suggestions of high utility. These findings should inform benchmark creation or selection for expert finding, to guarantee meaningful comparison of methods.

annotation, knowledge area, recommendation, (15 more...)

arXiv.org Artificial Intelligence

2410.05018

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > New York > New York County > New York City (0.05)
Europe > Belgium > Flanders > East Flanders > Ghent (0.04)
(6 more...)

Genre: Research Report > Experimental Study (0.34)

Industry: Law (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.47)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

A diverse Multilingual News Headlines Dataset from around the World

Leeb, Felix, Schölkopf, Bernhard

arXiv.org Artificial IntelligenceMar-28-2024

Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide with English translations of all articles included. Designed for natural language processing and media studies, it serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles, for example, to analyze global news coverage and cultural narratives. As a simple demonstration of the analyses facilitated by this dataset, we use a basic procedure using a TF-IDF weighted similarity metric to group articles into clusters about the same event. We then visualize the \emph{event signatures} of the event showing articles of which languages appear over time, revealing intuitive features based on the proximity of the event and unexpectedness of the event. The dataset is available on \href{https://www.kaggle.com/datasets/felixludos/babel-briefings}{Kaggle} and \href{https://huggingface.co/datasets/felixludos/babel-briefings}{HuggingFace} with accompanying \href{https://github.com/felixludos/babel-briefings}{GitHub} code.

dataset, diverse multilingual news headline dataset, news coverage, (10 more...)

arXiv.org Artificial Intelligence

2403.19352

Country:

North America > United States > District of Columbia > Washington (0.04)
Europe > Italy > Tuscany > Florence (0.04)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
(3 more...)

Genre: Research Report (0.64)

Industry: Media > News (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Classification (0.47)

Add feedback

ROUGE-K: Do Your Summaries Have Keywords?

Takeshita, Sotaro, Ponzetto, Simone Paolo, Eckert, Kai

arXiv.org Artificial IntelligenceMar-8-2024

Keywords, that is, content-relevant words in summaries play an important role in efficient information conveyance, making it critical to assess if system-generated summaries contain such informative words during evaluation. However, existing evaluation metrics for extreme summarization models do not pay explicit attention to keywords in summaries, leaving developers ignorant of their presence. To address this issue, we present a keyword-oriented evaluation metric, dubbed ROUGE-K, which provides a quantitative answer to the question of -- \textit{How well do summaries include keywords?} Through the lens of this keyword-aware metric, we surprisingly find that a current strong baseline model often misses essential information in their summaries. Our analysis reveals that human annotators indeed find the summaries with more keywords to be more relevant to the source documents. This is an important yet previously overlooked aspect in evaluating summarization systems. Finally, to enhance keyword inclusion, we propose four approaches for incorporating word importance into a transformer-based model and experimentally show that it enables guiding models to include more keywords while keeping the overall quality. Our code is released at https://github.com/sobamchan/rougek.

computational linguistic, keyword, rouge-k, (15 more...)

arXiv.org Artificial Intelligence

2403.05186

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Washington > King County > Seattle (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
(9 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Unsupervised hard Negative Augmentation for contrastive learning

Shu, Yuxuan, Lampos, Vasileios

arXiv.org Artificial IntelligenceJan-4-2024

We present Unsupervised hard Negative Augmentation (UNA), a method that generates synthetic negative instances based on the term frequency-inverse document frequency (TF-IDF) retrieval model. UNA uses TF-IDF scores to ascertain the perceived importance of terms in a sentence and then produces negative samples by replacing terms with respect to that. Our experiments demonstrate that models trained with UNA improve the overall performance in semantic textual similarity tasks. Additional performance gains are obtained when combining UNA with the paraphrasing augmentation. Further results show that our method is compatible with different backbone models. Ablation studies also support the choice of having a TF-IDF-driven control on negative augmentation.

augmentation, dataset, proceedings, (12 more...)

arXiv.org Artificial Intelligence

2401.02594

Country:

Europe > United Kingdom > England > Greater London > London (0.04)
Europe > Finland > Uusimaa > Helsinki (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Understanding TF-IDF in NLP: A Comprehensive Guide

#artificialintelligenceMar-21-2023, 06:10:58 GMT

Natural Language Processing (NLP) is an area of computer science that focuses on the interaction between human language and computers. One of the fundamental tasks of NLP is to extract relevant information from large volumes of unstructured data. In this article, we will explore one of the most popular techniques used in NLP called TF-IDF. TF-IDF is a numerical statistic that reflects the importance of a word in a document. It is commonly used in NLP to represent the relevance of a term to a document or a corpus of documents.

corpus, frequency, tf-idf, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

How to Use SVD and NMF in Python

#artificialintelligenceMar-10-2023, 17:45:38 GMT

In the context of Natural Language Processing (NLP), topic modeling is an unsupervised learning problem whose goal is to find abstract topics in a collection of documents. Topic Modeling answers the question: "Given a text corpus of many documents, can we find the abstract topics that the text is talking about?" By the end of this tutorial, you'll be able to build your own topic models to find topics in any piece of text. Let's start by understanding what topic modeling is. Suppose you're given a large text corpus containing several documents.

document-term matrix, text corpus, word cloud, (12 more...)

#artificialintelligence

Genre: Instructional Material > Course Syllabus & Notes (0.55)

Technology: Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.61)

Add feedback

Word-level Text Highlighting of Medical Texts for Telehealth Services

Ozyegen, Ozan, Kabe, Devika, Cevik, Mucahit

arXiv.org Artificial IntelligenceAug-1-2022

The medical domain is often subject to information overload. The digitization of healthcare, constant updates to online medical repositories, and increasing availability of biomedical datasets make it challenging to effectively analyze the data. This creates additional work for medical professionals who are heavily dependent on medical data to complete their research and consult their patients. This paper aims to show how different text highlighting techniques can capture relevant medical context. This would reduce the doctors' cognitive load and response time to patients by facilitating them in making faster decisions, thus improving the overall quality of online medical services. Three different word-level text highlighting methodologies are implemented and evaluated. The first method uses TF-IDF scores directly to highlight important parts of the text. The second method is a combination of TF-IDF scores and the application of Local Interpretable Model-Agnostic Explanations to classification models. The third method uses neural networks directly to make predictions on whether or not a word should be highlighted. The results of our experiments show that the neural network approach is successful in highlighting medically-relevant terms and its performance is improved as the size of the input segment increases.

dataset, information, medical chat highlighting dataset, (14 more...)

arXiv.org Artificial Intelligence

2105.104

Country:

North America > United States > New York (0.04)
North America > Canada > Ontario > Toronto (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
Europe > Finland (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Health & Medicine > Health Care Technology > Telehealth (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback